Skip to main content
Version: 25.10 (Latest)

Predefined Content Identifier Rules

Cyberhaven provides an extensive library of predefined Content Identifier Rules that form the foundation of the content inspection system. These rules are built using the Nucleuz Classification Engine and are designed to detect specific data types, patterns, and sensitive information with high accuracy and performance.

Overview

The Content matching rules page under Preferences provides a unified interface for you to define content inspection rules.

On the Rules tab, you can:

  • View all predefined and custom rules
  • Create new custom rules
  • Delete custom rules
  • Enable/Disable predefined and custom rules

Note

    Predefined rules cannot be deleted.

    Rules that are currently applied in a policy or dataset cannot be deleted.

Predefined Rules Library

Cyberhaven includes a comprehensive library of predefined classification rules that detect sensitive data types commonly found in organizations worldwide. These rules are organized into logical groupings and cover various data protection requirements, regulatory compliance needs, and industry-specific data types.

Rule Categories

The predefined rules are organized into several high-level categories:

Personal Identifiers

  • Social Security Numbers: Various national formats with validation
  • National ID Numbers: Country-specific identification formats
  • Passport Numbers: International passport number patterns
  • Driver License Numbers: Regional license number formats
  • Tax Identification Numbers: Country-specific tax ID patterns

Financial Data

  • Credit Card Numbers: Multiple card types with Luhn algorithm validation
  • Bank Account Numbers: Various national and international formats
  • IBAN Numbers: International Bank Account Number validation
  • Routing Numbers: Banking routing and transit numbers
  • SWIFT Codes: Bank identifier codes

Healthcare Data

  • Medical Record Numbers: Healthcare identifier formats
  • Drug Enforcement Agency (DEA) Numbers: US pharmaceutical tracking
  • National Provider Identifiers: Healthcare provider IDs
  • Health Insurance Numbers: Medical insurance identifiers
  • Patient Identifiers: Various healthcare system IDs

Communication Data

  • Email Addresses: Various email format patterns
  • Phone Numbers: International and domestic phone formats
  • IP Addresses: IPv4 and IPv6 address patterns
  • URLs and Domains: Web address patterns
  • MAC Addresses: Network hardware identifiers
  • Voter Registration Numbers: Electoral system identifiers
  • Court Case Numbers: Legal system case identifiers
  • License Numbers: Professional and business licenses
  • Permit Numbers: Government permit identifiers
  • Registration Numbers: Various government registrations

Authentication and Security

  • API Keys: Various API key patterns
  • Access Tokens: Authentication token formats
  • Passwords: Password pattern detection
  • Cryptographic Keys: Encryption key patterns
  • Certificates: Digital certificate identifiers

Rule Structure and Components

Each predefined rule contains multiple components that work together to accurately detect sensitive data:

Pattern Matching

  • Regular Expressions: Sophisticated regex patterns for format detection
  • Format Validation: Specific formatting requirements (e.g., XXX-XX-XXXX)
  • Length Constraints: Minimum and maximum character limits
  • Character Sets: Allowed characters and encoding requirements

Validation Functions

  • Checksum Algorithms: Mathematical validation (e.g., Luhn algorithm for credit cards)
  • Format Verification: Structural validation of data patterns
  • Range Validation: Numeric range checking where applicable
  • Cross-Reference Validation: Verification against known valid patterns

Context Analysis

  • Supporting Keywords: Contextual terms that increase confidence
  • Proximity Analysis: Related terms within specified distance
  • Document Structure: Location-based context (headers, forms, etc.)
  • Language Support: Multilingual keyword recognition

Confidence Scoring

  • Base Confidence: Initial confidence based on pattern match
  • Context Boost: Additional confidence from supporting evidence
  • Validation Confirmation: Confidence increase from successful validation
  • Threshold Management: Configurable confidence thresholds

Rule Performance Characteristics

Detection Accuracy

  • High Precision: Minimized false positives through validation
  • Comprehensive Coverage: Multiple patterns for format variations
  • Contextual Awareness: Reduced false positives through context analysis
  • Adaptive Thresholds: Configurable sensitivity levels

Processing Efficiency

  • Optimized Patterns: Regular expressions tuned for performance
  • Parallel Processing: Rules designed for concurrent execution
  • Memory Efficiency: Optimized memory usage patterns
  • Scalable Architecture: Performance maintained at scale

Regional Adaptations

  • Localized Patterns: Country-specific data formats
  • Language Support: Multilingual keyword recognition
  • Cultural Context: Region-appropriate detection patterns
  • Regulatory Alignment: Compliance with local data protection laws

Example Rule Types

Social Security Number (US)

  • Pattern: XXX-XX-XXXX format detection
  • Validation: Area number and group number validation
  • Context: Keywords like "SSN", "Social Security", "Tax ID"
  • Confidence: High confidence with validation, medium without

Credit Card Numbers

  • Pattern: 13-19 digit sequences with optional separators
  • Validation: Luhn algorithm checksum verification
  • Context: Keywords like "card", "credit", "payment"
  • Types: Visa, MasterCard, American Express, Discover, etc.

Email Addresses

  • Pattern: Local@domain format with RFC compliance
  • Validation: Domain structure and character validation
  • Context: Communication-related keywords
  • Variations: Multiple format variations and international domains

IBAN Numbers

  • Pattern: Country code + check digits + account identifier
  • Validation: MOD-97 checksum algorithm
  • Context: Banking and financial keywords
  • Coverage: All IBAN-participating countries

Rule Management

Enabling Rules

Enable the predefined and custom rules you want to use for content inspection. Cyberhaven's content inspection engines will analyze content using the enabled rules to identify sensitive data patterns.

Rule Limitations

Note

    You cannot disable rules currently in use within a policy or dataset.

    There is a limitation on the total number of rules that can be enabled simultaneously, which depends on system resources and performance requirements.

Selection Guidelines

When selecting predefined rules:

  1. Data Relevance: Choose rules that match the types of sensitive data in your environment
  2. Regional Requirements: Select rules appropriate for your geographic regions
  3. Regulatory Compliance: Include rules required for applicable compliance frameworks
  4. Performance Impact: Consider the cumulative processing overhead of enabled rules
  5. Accuracy Requirements: Balance comprehensive coverage with acceptable false positive rates

Policy Association

Predefined rules are used within Content Identifier Policies to:

  • Define Detection Scope: Specify which data types to detect
  • Set Confidence Thresholds: Configure sensitivity levels
  • Combine Multiple Rules: Create comprehensive detection policies
  • Enable Contextual Detection: Leverage supporting evidence

Performance Considerations

Resource Usage

  • CPU Impact: Processing overhead varies by rule complexity
  • Memory Requirements: Rules consume system memory during execution
  • I/O Considerations: Content scanning affects storage and network performance
  • Scalability: Performance impact scales with content volume and rule count

Optimization Strategies

  • Selective Enablement: Enable only necessary rules for your environment
  • Threshold Tuning: Adjust confidence thresholds to balance accuracy and performance
  • Rule Prioritization: Focus on high-value data types first
  • Performance Monitoring: Track system performance with different rule configurations